Reliability and predictability of phenotype information from functional connectivity in large imaging datasets

One of the central objectives of contemporary neuroimaging research is to create predictive models that can disentangle the connection between patterns of functional connectivity across the entire brain and various behavioral traits. Previous studies have shown that models trained to predict behavioral features from the individual’s functional connectivity have modest to poor performance. In this study, we trained models that predict observable individual traits (phenotypes) and their corresponding singular value decomposition (SVD) representations – herein referred to as latent phenotypes from resting state functional connectivity. For this task, we predicted phenotypes in two large neuroimaging datasets: the Human Connectome Project (HCP) and the Philadelphia Neurodevelopmental Cohort (PNC). We illustrate the importance of regressing out confounds, which could significantly influence phenotype prediction. Our findings reveal that both phenotypes and their corresponding latent phenotypes yield similar predictive performance. Interestingly, only the first five latent phenotypes were reliably identified, and using just these reliable phenotypes for predicting phenotypes yielded a similar performance to using all latent phenotypes. This suggests that the predictable information is present in the first latent phenotypes, allowing the remainder to be filtered out without any harm in performance. This study sheds light on the intricate relationship between functional connectivity and the predictability and reliability of phenotypic information, with potential implications for enhancing predictive modeling in the realm of neuroimaging research.


Introduction
One of the central objectives of contemporary neuroscience is to elucidate a connection between brain structure/function and behavior.This connection could provide clinically relevant metrics, and be used for prognosis and treatment decisions.In order to identify such connections, many neuroimaging studies have focused on predicting observable characteristics or behavior -broadly, phenotypes -from brain measurements.Despite the popularity of this question, most studies report a small correlation of r = 0.1−0.4 between the phenotypes and their predictions from brain measurements; moreover, the prediction performance for similar or related phenotype measures may vary substantially across studies, (Greene and Constable, 2023;Sui et al., 2020;Poldrack et al., 2020;Pervaiz et al., 2020;Genon et al., 2022).
The situation described above can be attributed to several factors, for instance intrinsic variations of scanning equipment, noise at various stages of acquisition, or natural and pathological differences between study participants, among others.Moreover, the effect of these factors can be exacerbated by variations in sample size across studies.In general, the use of machine learning methods in typical clinical neuroimaging datasets is severely constrained by their size (tens to low hundreds of subjects), in comparison with their dimensionality (tens of thousands of features).Determining how much data is needed to train models to predict brain-behavior associations is still an open question.Recent research by Marek et al. (2022) suggested that at least a thousand subjects are needed to accurately measure univariate brainbehavior associations, and that small studies fail to replicate and produce inflated effect sizes.This controversial study prompted an avalanche of responses showing that the brain-behavior relationships could be predicted from smaller datasets by using multivariate models and hold-out samples (Spisak et al., 2023) and thoughtfully choosing the brain data and behavioral targets of prediction (Cecchetti and Handjaras, 2022;Gratton et al., 2022;Rosenberg and Finn, 2022).
In addition, recent studies (Nikolaidis et al., 2022;Gell et al., 2023) argue that increasing sample size might improve the prediction performance, but not be sufficient for consistent results.They argue that latent phenotypes from functional connectivity to the performance of models predicting the original phenotypes from functional connectivity.
Since the reliability of the predicted variable limits the maximal observable predictive performance, we also evaluated the reliability of the latent phenotypes across different participant samples, and the relationship between reliability and predictability from functional connectivity.Finally, we tested whether the phenotype information predictable from functional connectivity is primarily contained in reliable latent phenotypes.To this effect, we reconstructed phenotypes from the subset of reliable latent variables, and evaluated the performance of models trained to predict these reconstructed phenotypes from functional connectivity.Overall, our study provides insights into the relationship between prediction accuracy and reliability and extends our understanding of the extent to which latent, and reconstructed phenotypes can be predicted from resting state fMRI.

Datasets
In the analyses described in this paper, we used resting state fMRI data and phenotype information from two datasets: the Human Connectome project (Van Essen et al., 2012) and Philadelphia Neurodevelopmental Cohort (Satterthwaite et al., 2016).We will focus on predicting phenotypes only from functional connectivity (FC) measures derived from the fMRI data, as prediction models derived from FC outperform other modalities in general (Ooi et al., 2022).We provide dataset-specific details in the following subsections, and describe the common extraction of FC matrix after that.The code used for the analysis, together with a list of all subjects used for training, validation, and hold-out, can be found at the following repo: https://github.com/JessyD/bblocks-phenotypes.Our work relies on the recently released "dataset-phenotypes" tool4 , which outputs BIDS tabular phenotypic data dictionaries and transforms tabular phenotypic data to BIDS TSVs for common neuroimaging datasets.
Subjects were chosen if their imaging preprocessing pipeline was completed without error.While for PNC, we removed frames that were considered as high-motion using their framewise displacement and DVARS (as described below), we did not use quality control metrics to select HCP subjects.Subjects were split into a training set (N=856; 385 males/471 females aged 28.88 ± 3.70 years), a validation set (N=108; 45 males/63 females aged 28.16 ± 3.50 years), and a separate hold-out set (N=107; 55 males/52 females aged 28.69 ± 3.96).The hold-out set was not used for the results reported in this paper, and is being kept for a future pre-registered analysis.Subjects were assigned to the training, validation, and hold-out sets in a two-stage procedure.First, all families with two or more siblings were included in the training set.This was done to prevent leakage of information between twins or other siblings, which could happen if they were split between training and the other two datasets.Second, all subjects were assigned at random to the three datasets, until the specified number of participants was obtained in each.
Imaging Data We used the cleaned version of the HCP S1200 release, and used the grayordinate restingstate functional MRI data processed with ICA-FIX and MSM-All provided by Glasser et al. (2013a), a pipeline which was developed and optimized for this dataset.
Behavioral Phenotypes We used 83 phenotypes scores (Appendix A1 includes a full list of all scores used), which span across behavioral domains of cognition and personality and some additional variables that we included to be consistent with prior work.To facilitate comparisons, we included all behavioral measures previously used by Ooi et al. (2022), Kong et al. (2019) and He et al. (2020).All behavioral scores were individually z-scored across participants, and outliers that were above and below 3 standard deviations were treated as missing data as described below.We incorporated behavioral phenotypes categorized as "age-adjusted behaviors", which had undergone prior age adjustments by the HCP team, alongside the raw, unadjusted behaviors.The difference between "age-adjusted behaviors" and "unadjusted" behavior is that, while the Unadjusted Scale Score evaluates a participant's performance compared to the entire NIH Toolbox Normative Sample, the Age-Adjusted Scale Score assesses a participant's performance in the context of a specific age group within the Toolbox Norming Sample (e.g., 18-29 or 30-35) (Slotkin et al., 2012).Both the outliers and the missing data were imputed using the IterativeImputer function from scikit-learn (Pedregosa et al., 2011), which models each feature as a function of other features, and uses the model to predict its missing values, iterating over features in a round-robin fashion.To make sure that we keep the development and validation sets completely separate, we trained the imputation function on the development set and applied it to the validation set.
Philadelphia Neurodevelopmental Cohort (PNC) As described by Satterthwaite et al. (2016), the PNC dataset was acquired using a 3T Siemens TIM Trio, with a 32-channel head coil with TR=3000 ms, TE=32 ms, flip angle=90 deg, BW=2056, FOV=192 x 192 mm, voxel resolution of 3x3x3mm with 46 slices.For our analysis, we used 1082 subjects.Similar to the HCP dataset, we split the subjects into a development set (N=864; 404 males/460 females aged 15.670 ± 3.376 years), validation set (N=109; 48 males / 61 females aged 16.231 ± 3.364), and an unused hold-out (N=109, 47 males/ 62 females aged 15.97 ± 3.38) dataset reserved for future pre-registered analyses.Subjects were assigned to the training, validation, and hold-out sets in the same two-stage process used in the HCP data.
Imaging Data We preprocessed the PNC dataset using fmriprep version 21.0.2 (Esteban et al., 2019) (additional information about all software used can be found in Appendix A.5.1).For motion censoring, time points were considered individually.If a time point exceeded 0.2 mm frame-wise displacement or a derivative root mean square (DVARS) above 75, it was marked as a point to be censored.Intervals of less than five points between censor points were also censored.The six estimated head-motion parameters, their derivatives, the average signal within the anatomically-derived white matter, and cerebrospinal fluid masks obtained from fmriprep were used as nuisance variables and regressed from the fMRI signal before we computed the FC.
Behavioral Phenotypes Although the PNC dataset includes surveys, cognitive and clinical phenotypes, for this analysis we did not use surveys or clinical phenotypes.We selected 39 summary scores from cognitive tasks from the PNC dataset (Satterthwaite et al., 2016;Gur et al., 2010;Roalf et al., 2014).See Appendix A. 5.3 for a full list of all scores used.Similar to the procedure used for the HCP dataset, missing values and outliers were imputed using the IterativeImputer function from scikit-learn, and each measure was z-scored across participants.

Generation of datasets for prediction experiments
We constructed functional connectivity matrices by computing the Pearson correlations between the average time series for each pair of brain regions.Regions were defined using the 17-network Schaefer 400 parcellation (Schaefer et al., 2018).If more than one resting state session was available for a subject, we averaged the resulting functional connectivity matrices across sessions and used the resulting one in our analyses.If any of the runs had more than 50% of the runs flagged as time points to be censored, it was excluded from the computation of functional connectivity.Across this paper, we will refer to the data matrix X n×d ), where n is the number of subjects available and d the number of pairwise connections after vectorizing the lower diagonal (in our case 79800).These pairwise connections will be the input features for the prediction model described in Section 2.5.The process is illustrated in Figure 1.

Dimensionality reduction of phenotype data
Linear dimensionality reduction The following paragraphs compare different types of linear dimensionality reduction.They introduce those methods and compare them to SVD, the method we used in further analysis.
In our experiments, we structure phenotype data as a matrix with n rows (participants) and d columns (phenotype measurements).Each row is therefore a d-dimensional observation for a participant.A linear dimensionality reduction method expresses that matrix as a product of a n × k latent variable matrix, and a k × d matrix that expresses observation vectors as a linear combination of latent variables (often called the mixing or loading matrix).The latent variable matrix can be viewed as a reduced dimensionality representation of the original matrix.The underlying assumption behind this framing is that Fig. 1.Schematic overview of how the data matrix is constructed.First, the brain activity is parcellated using an atlas of choice (here we used the Schaefer 400 parcellation).The functional connectivity matrix is created by computing the Pearson correlation between the average resting-state fMRI time series in each pair of regions.Because the FC matrix is symmetric between its upper and lower triangular entries, we flatten only the lower triangular entries of the functional connectivity matrix.
each observation does not vary independently in d different ways.Instead, those d measures are driven by one or more latent variables, and that relationship holds across participants.If measurements were the summary scores of different instruments, the latent variables could correspond to factors that drive variance across them; in that case, latent variables would be helpful to elicit relationships between instruments.Aside from their role in psychology research, latent variables are potentially useful as prediction targets from imaging, as discussed earlier.This is because they are estimated from multiple phenotype measures they are associated with, and hence, are potentially less noisy.Factorizations can also be used for denoising, in that the product of the latent variable matrix and the mixing matrix can be viewed as an approximation that tries to keep the essential characteristics of the data.This is usually done by considering fewer than the maximum number of latent variables that can be estimated from the data matrix.
Linear dimensionality reduction methods There are many linear dimensionality reduction methods.They differ primarily in their assumptions about the relationship between latent variables, and their connection to the measured variables.While it is beyond the scope of this paper to review all the methods available, we briefly do so for those that have been used for experiments similar to those we carry out.Factor Analysis (FA) finds latent variables (factors) and a mixing matrix (loadings) such that between factor correlation is minimized.FA is used in the development of psychometric questionnaires, where the goal is to relate answers to questions to hypothesized latent constructs driving behavior.FA usually includes a step called varimax rotation (Kaiser, 1958), where the goal is to further transform the factor/loading relationship so that each factor loads on as few phenotype measurements as possible, to facilitate interpretability.
Independent Component Analysis (ICA) finds latent variables (components) and a mixing matrix such that the factors are statistically independent, not orthogonal, but uncorrelated.Neither FA nor ICA have an intrinsic criterion for choosing the number of dimensions k analogous to the percentage of variance explained in PCA and SVD.
Principal Component Analysis (PCA) finds successive orthogonal projection directions of the data that maximize variance after projection.Each latent variable is defined as the projection of all the data points into one dimension; the mixing matrix is derived as part of the process of finding the projection directions.The intuition for this approach is that important latent variables should drive a lot of variance in the observations.Similar to FA, a varimax rotation can also be applied to a PCA to simplify the interpretation of the components.
Singular Value Decomposition (SVD), the technique we will use in this paper, finds the same projection directions as PCA if the inputs to the SVD are mean-centered.Both PCA and SVD share a useful characteristic that other methods do not have: the percentage of variance explained provides an order of importance of the latent variables.In addition, SVD has a number of practical advantages over PCA (e.g., it does not require computation of a covariance matrix), as well as useful theoretical properties.The one that is most important for this paper is that the data reconstruction through SVD, using k latent variables, is the best rank k approximation of the data, in the least squares sense.We provide an illustrated introduction to SVD, its mathematical properties, and its relationship with PCA in Section 5.2 of the Appendix.Finally, there are methods such as reduced rank regression that implicitly perform an SVD/PCA of a dataset of target variables, as part of a multivariate, multiple regression model.Given that the results are similar to those obtained by computing an SVD of the target variables, predicting each latent variable independently, and reconstructing predictions of the targets, we do not consider these further.

Controlling for age/sex
Before training the prediction models we regressed out age and sex at birth terms from each phenotype measure in both datasets: age, sex, sex 2 , and the interactions age × sex and age 2 × sex.Sex at birth was coded as a 0/1 binary variable and age was z-scored.The phenotype measure values used in the prediction experiments were the residual values after fitting this regression model.The purpose of regressing age and sex information out of the phenotype targets was to determine the degree to which their predictability was a combination of a) their being predictable from age/sex (e.g. a developmental phenotype measure) and b) age/sex being predictable from the imaging data.We deliberately did not regress age/sex information out of the imaging data.Our plan is to train neural networks that predict age/sex, in tandem with other phenotypes where age/sex have been regressed out.For that application, age/sex should not be regressed out of the imaging data.As these experiments are meant to produce baseline results using linear models, against which neural network results can be compared, we opted not to regress age/sex out of the imaging data in this case as well.
We also conducted a paired two-sided t-test to determine if there was a significant difference in performance before and after adjusting for age and sex.Before running the t-test, we calculated the correlations between the actual and predicted values for models both with and without the regression adjustments.We then applied the Fisher z-transformation to these correlations and calculated the average across multiple repetitions.

Prediction experiments
Prediction model We used a Ridge regression model to find the linear relationship between the imaging data and each of the selected behavioral phenotypes, or the latent variables derived from them.We chose a model with L 2 -regularisation, as implemented by the scikit-learn library (Pedregosa et al., 2011), version 1.2.2, as it can handle ill-posed problems that result from extremely correlated features.The reason for only using a linear prediction model in this work is that our primary goal is to obtain baseline results for future nonlinear models.

Experimental setup
The predictive results reported reflect the mean and standard deviation -which is an estimate of the standard error of the mean -across 100 bootstrap samples, taking age, gender and family information into account.In this procedure, we first split the training data into a part that would be resampled (80% of the total dataset) and a part that would be used to evaluate the performance of models with different regularization parameters, which was kept fixed (10% of the total dataset).The remaining 10% was kept as a hold-out dataset and was not used in this study.We did not use nested cross-validation.For each resampled training set, we found the optimal setting of the l 2 -regularisation term (α) using a grid search over the search space [10 2 , 10 7 ], applying the model to the fixed part of the dataset used for evaluation.The model with the best l 2 -regularisation term was then applied to the validation set, yielding one result of the 100 in the sample for predicting that particular phenotype measure.We performed the SVD of the phenotypes within each resampled training set.These were zscored in a column-wise fashion, i.e., we computed the mean and standard deviation across the training set to normalize that specific phenotype, so that the features with large means or variance would not dominate the results, and repeated this procedure for all phenotypes.We then used the mean and standard deviation from the training dataset to z-score the phenotype measures in the validation set.

Model evaluation
The performance of each model was evaluated using the coefficient of determination (r 2 ) and Pearson's correlation between the predicted measure and the actual measure in the validation set.While Pearson's correlation assumes values between -1 and 1, the r 2 can assume values between (−∞, 1] where 1 corresponds to the best possible score, as it is being computed on a separate dataset. We used the Autorank library (Herbold, 2020;Demšar, 2006) to evaluate if there was any statistical difference in the performance of the prediction algorithms using all of the components or a subset thereof.The Autorank library first assesses the normality and heteroscedasticity of the data before selecting the most suitable group level and post-hoc test to determine differences between the groups.The family-wise significance level for these tests was set at α = 0.05.We provided inputs to the Autorank library by calculating the correlation between the predictions and the actual values, applying the Fisher-z transformation to these correlations, and then computing the average correlation across multiple bootstraps.

Comparing predictive performance on original phenotypes and latent phenotypes
The latent phenotypes are composed of linear combinations of the original phenotypes, so directly comparing predictive performance between the two is not straightforward.Instead, we computed absolute difference between z-scored values of the prediction and true phenotypes (i.e., the error) for the component with the highest prediction, taking the average over all bootstraps.We also computed the same difference for the best performing latent phenotype.We then used a paired t-test to test the difference in error between the best performing latent phenotype and the best performing original phenotype.

Latent phenotypes reliability analysis
We conducted analyses to assess the reliability of the latent phenotypes produced by the SVD separately in the HCP and PNC datasets.Specifically, we wanted to determine how similar each latent phenotype would be if the SVD was conducted in distinct sets of subjects with the same phenotypic measures.We accomplished this by combining the training and validation subsets and then randomly splitting these combined datasets into two halves (without replacement) 1000 times.Splits were applied at the family level so that all members of a family were in the same subsample in every splitting.Since the latent phenotypes produced by the SVD are ordered by percent variance explained, the order may differ between samples.To ensure that the order was consistent across all subsample, we first ran an SVD on the combined dataset, saving the latent phenotypes.Then the latent variables from each subsample were reordered to match the order of latent phenotypes in the combined dataset.Components were matched via the Gale-Shapley stable marriage algorithm (Gale and Shapley, 1962) using the correlation between latent phenotype weights as the preferences.Once the latent phenotypes from each subsample within a split were aligned, we took the correlation between the latent phenotype's weights on the features (following the SVD notation, this would be V T ) as our measure of latent phenotype reliability.In all cases age and sex were regressed out of the phenotypes, values more than three standard deviations from the mean were censored, and missing values were imputed as described above prior to running the SVDs.These preprocessing steps were carried out separately in each subsample.Empirical 95% confidence intervals were determined from the distribution of inter-subsample correlations across splits.Correlation coefficients were Fisher z-transformed prior to averaging across splits and then transformed back for plotting/analysis.

Predictability of phenotype measures from functional connectivity
The goal of this experiment was to ascertain the degree to which each phenotype measure is predictable from functional connectivity, in both HCP and PNC.
As described in Section 2.5, we trained ridge regression models to predict each phenotype measure across 100 bootstrap resamplings; we then used them to generate predictions for participants in the validation set, which was fixed rather than resampled.This resampling scheme allows us to estimate the variability in performance across potential training sets for the fixed validation set.We averaged prediction results across resamplings, and used those results to estimate the standard deviation of the mean estimate.Figure 2.a and Figure 3.a show the average correlation between the phenotype predictions and their true values, with the measures sorted by predictability.We can see that for the HCP data the most predictable phenotype is Strength Unadjusted, which measures the Grip Strength (Figure 2a).Because of the relation between strength and sex at birth, we repeated the same analysis after regressing out age and sex and their interaction as confounds.Appendix Figure 8 shows the corresponding results using r 2 as the metric of performance.

Predictability of phenotype measures from functional connectivity after regressing the confounds
One of the issues in determining how predictable a phenotype measure is from imaging is the presence of potential confounds, such as participant sex at birth, age, and time of scan, which are to some extent  predictable from imaging data.Given this, a phenotype measure might be predictable from imaging data because a) the confound can be predicted from imaging data, and b) the phenotype measure can be predicted from the confound.To try to minimize the impact of confounding variables on our predictions, we regressed out age, sex, and age×sex interaction, as well as their squares, from each phenotypic measure.Figure 4 illustrates the regression beta coefficients for each phenotype measure in HCP (Figure 4a) and PNC (Figure 4b).The magnitude of the beta coefficient indicates the strength of the relationship between the independent variable and the dependent variable.A list of all the phenotypes used, the acronyms and a short description can be found in Table 1 (HCP) and Table 5.3

(PNC).
In the HCP analysis, we did not see a large dependence between the variables and age, and the highest beta coefficient was observed for "Strength Unadjusted", where there was a high interaction with sex.This aligns with our expectations as HCP is a young adult cohort while PNC is a neurodevelopmental cohort.Note that we included behavioral phenotypes that had previously been adjusted for age by the HCP team (referred to as "age-adjusted behaviors") and the raw unadjusted behaviors.Our initial aim was to compare the beta coefficients between the adjusted and unadjusted variables provided by the HCP consortia.We observed that if we had trained our model without correcting for the confounds, some of the predictive performance would be due to their presence.For the PNC dataset (Figure 4b), the highest beta coefficients are observed between age and Total Correct Response for All Test Trials (PEDT A), Median Response Time for All Test Trials (PADT T), and Total Raw Score (WRAT CR RAW).
We also compared predictive performance for phenotypes in which we regressed out age and sex confounds or not.Regressing out age and sex results in a significant decrease in mean prediction across all phenotypes (before regression: 0.172 ± 0.143 (mean ± standard deviation); after regression 0.096 ± 0.095; t-statistic:-6.37,value:1.75e-07,df:38)

Predictability of SVD latent variables from functional connectivity
As described earlier, SVDs were fit to each resampled training set and used to produce latent phenotype scores for that training set and the fixed validation set.Contrary to our expectations, we found that SVD latent phenotypes were not more predictable than individual phenotype measures (correlation in Figure 2d and Figure 3d; r 2 in Appendix Figure 8d and Figure 9d).We evaluated this by performing a paired t-test to compare the absolute error between predicted and true z-scored values for the most predictable latent factor and the most predictable phenotypes.In both datasets the error was not significantly different between the best phenotype (HCP: phenotype mean error (std):0.890(0.600)/ PNC: phenotype mean error (std): 0.891 (0.654)) and the best latent phenotype (HCP: latent phenotype mean error (std): 1.001 (0.645) / PNC: latent phenotype mean error (std): 0.950 (0.641); HCP: t-statistic=-1.78,p-value=0.078,df=107 / PNC: t-statistic=-1.138,p-value=0.258,df=108).Cumulative explained variance Individual explained variance Fig. 5. Explained variance for each of the latent phenotype variables obtained after applying an SVD to the behavioral variables from HCP (green, left) and PNC (blue, right).In this experiment, the SVD was computed using the training and the validation data.The bar graphs show the variance explained by each latent variable, while the line plots show the cumulative variance explained.For both datasets, a large proportion of the variance is explained by the first two latent phenotypes; however, most latent phenotypes are needed to reach 95% of the variance, as represented by the dashed line in the figure.Note that the x-axis for these two plots are different as HCP has 83 phenotypes and PNC 39.

Low-dimensional representations of phenotype variables and their reliability
One of our goals was to understand the dependency structure between different phenotype measures, as captured through the SVD of the dataset.This analysis is important to determine if the phenotypes can be represented using a smaller set of uncorrelated latent phenotypes.
For both datasets, we can see that while the first two latent phenotypes explain about 30% of the variance, this decays rapidly for successive latent phenotypes (Figure 5).To attain a comprehensive 95% explanation of the variance, a substantial number of latent phenotypes are required.Considering that each dataset uses a diverse array of tests to cover various cognitive domains, we were not surprised to see that many of the components were required to explain 95% of the variance.We also conducted an experiment to examine the reliability of the latent phenotypes, i.e., how many of them would replicate if the SVD was applied to two independent samples drawn from the same population.To that effect, we simulated this situation with an experiment where we repeatedly split the phenotype dataset into two halves (i.e., sub-samples), and independently computed the SVD of each split.To ensure that the order was consistent across all subsample, we first ran an SVD on the combined dataset, saving the latent phenotypes.We then reordered the latent phenotypes in each split to match the order of latent phenotypes in the original dataset based on the Gale-Shapley stable marriage algorithm (Gale and Shapley, 1962) using the correlation between latent phenotype weights as the preferences.The matching procedure was repeated for 1000 random splits of the data.The first five latent phenotypes in each dataset had an inter-split r 2 greater than 0.2 in 95% or more or random splits (Figure 6, Figure 11 in the appendix for correlation).Both datasets also had some reliable low-variance latent phenotypes (69-73 and 83 in HCP, 35-39 in PNC).Outside of these latent phenotypes, the inter-split correlation of the weights was low.We postulate that this could be due to the loadings for remaining factors being driven by sample idiosyncrasies or the remaining relationships between phenotype measures being weak enough that noise can perturb their estimation.The predictability analysis reveals that the first three HCP latent phenotypes and the fifth HCP latent phenotype, along with the first PNC component, exhibit a correlation mean between true and predicted values that is above 0.15 (Figures 2 and 3).That predictability decays quickly beyond them with an increase in the prediction performance for a few of the last components that do not explain much variance.

Predictability of original phenotypes after training on phenotypes reconstructed from latent space
The previous sections investigated the predictability of individual phenotypes measures, and of latent phenotypes derived from them.We saw that only the first five latent phenotypes in each dataset were reliable (the first 5 components explain 40% of the variance in HCP and 45% on PNC), but explaining 95% of the variance would require most of the latent phenotypes (53 on the HCP and 28 in PNC).If we assume that variance explained by unreliable latent phenotypes will not be predictable from functional connectivity data, then these results suggest the intriguing possibility that most of the variance present across phenotype measures might not be predictable from functional connectivity data (at least using linear prediction models in a moderately sized sample of about 1000 subjects for both HCP and PNC).We can test this by comparing the predictive performance of models trained using progressively larger proportions of the variance in the original phenotypes to the performance of models trained on the original phenotypes themselves.If a model trained to predict a reduced variance representation of the phenotypes does as well as models trained with the original phenotypes, then this will indicate that that reduced variance is the maximally predictable portion of the variance.
To test this possibility, we took advantage of the fact that the SVD of a dataset can be used to generate the best possible rank k approximation of that dataset, in the least squares sense ( Appendix, section 5.2).This corresponds to reconstructing each phenotype measure by considering only those relationships to all other phenotype measures that explain the most variance.Specifically, we used the first five latent phenotypes, as those were reliability identified (Figure 6), to produce rank-1 through rank-5 approximate reconstructions of the phenotype measures in each resampling training set.We then evaluated models trained to predict the reconstructed phenotypes against the true, original phenotypes in the validation set.For both datasets, the reconstructed prediction of rank-1 to rank-5 was very similar to that obtained using the original phenotypes.(PNC: Figure 3c, HCP: Figure 2c).
To perform a quantitative comparison between the 5 reconstruction models and the baseline one, we performed a critical difference analysis (Demšar, 2006).This method is used to statistically compare the prediction performance of several models across many different prediction problems (phenotype measures, in this case), to determine which subsets have significant differences in performance relative to other subsets.In the case of HCP data, the post-hoc Tukey HSD test revealed a significant decline in performance when employing a single latent phenotype compared to all other groups (Figure 12a).Nevertheless, there was no statistically significant difference in performance across all measures when employing 2, 3, 4, 5, or all latent phenotypes.For the PNC dataset, there were no significant differences in performance between utilizing all latent phenotypes and any of the latent phenotypes.

Discussion
Phenotype measures are predictable in both HCP and PNC.In this study, we examined prediction of behavioral phenotypes from functional connectivity data.Several studies have trained different kinds of models to predict a variety of phenotype targets, but have mainly focused on one dataset (Chen et al., 2023;Bertolero and Bassett, 2020) or have used PCA with varimax transformation for dimensionality reduction (Ooi et al., 2022).As a first experiment, we tested phenotype prediction from functional connectivity in both the Human Connectome Project (HCP) and Philadelphia Neurodevelopmental Cohort (PNC) datasets.We found that phenotypes can be predicted from resting state functional connectivity in both datasets (Figure 2b and Figure 3b).Similar to other studies, we observed correlations between predicted and true values in the range of 0.1-0.4 for the majority of phenotypes.For example, the top three phenotypes that could be predicted with the highest performance were Working memory score across all memory tasks (WM Task ACC), 2-min walk endurance test (Endurance Unadj), Attention problems (ASR ATTN T) in the HCP (Figure 2b) and Professional Verbal Reasoning Test Total Correct Responses for All Test Trials (PVRT CR) Wide Range Achievement Test's Wide Range Assessment Test 4 Total Raw Score (WRAT CR RAW) and Primary Mental Abilities total correct response (PMAT CR) in the PNC dataset (Figure 3b).
Regression of the confounds is important.Studies predicting phenotypes from imaging data do not always control for demographic confounds in their analyses (Chyzhyk et al., 2022).While this may make sense in some cases (as we discuss below), failure to control for these confounding variables can inflate estimates of how predictable some phenotypes are if those phenotypes are correlated with the demographic confounds.To evaluate the degree to which this could be an issue, we considered the two most common potential confounds: age and sex.We regressed age, sex, age squared and the interactions of age and age squared with sex from our prediction targets -phenotype measures -and evaluated differences in predictive performance.This showed that confounds can inflate the prediction results if not accounted for; this is particularly visible, for instance, in the relation between the phenotypic variable "strength unadjusted" and sex in the HCP dataset (r = 0.6 before; r = 0.1 after adjustment; Figure 2a and Figure 2b).It is also important to note that it is essential to handle confounding separately for the training and test data.Otherwise, attempting to control for these variables across the entire dataset can undermine the independence of the training and testing data (Chyzhyk et al., 2022).
Finally, it is worth noting that it may be useful to keep the confounding variables in either the prediction targets or the imaging data, depending on the purpose of the prediction model.One example would be age, which is sometimes regressed out as a confounding variable, and sometimes added as an additional predictor to explain part of the variance.For example, in the context of predicting Alzheimer's disease, not regressing age out of imaging might help prediction models that can consider combinations of age and other features.It is still an open question, and left as an additional researcher's degree of freedom, if the confounds should be removed from the image, targets, or both.In this manuscript, we explored the effects of removing the confounds from the phenotypes but future work should consider regressing the confounds from the images.SVD latent variables are not more predictable than individual phenotype measures.Many of the phenotype measures collected in large batteries are correlated to some extent.As discussed earlier, dimensionality reduction techniques can be applied to transform phenotype measures into a latent variable representation.These variables are usually uncorrelated or independent and, informally, explain different aspects of the phenotype measures.Beyond the scientific reasons for generating them, latent variables may also be cleaner than individual phenotype measures, if the individual measures are taken to be noisy measurements of an underlying construct.And, finally, reducing dimensionality also reduces the number of prediction targets to consider.
Several studies have constructed uncorrelated latent variable representations, with each variable explaining different aspects of the phenotypes (Schöttner et al., 2023;Ooi et al., 2022;Chen et al., 2022).For example, Ooi et al. (2022) computes a factor analysis of the behavioral scores and predicts the latent phenotypes instead of the raw phenotypes.They observed that latent phenotypes, in particular the first three, are more predictable than using individual measures and that functional connectivity outperforms other modalities to predict behavioral phenotypes.In this study, we also transformed the phenotype measures into a latent variable representation, and used those latent variables as targets for prediction experiments.Figure 2d and Figure 3d illustrates that even though the first components have the highest prediction performance, the obtained performance is not higher than that obtained for the untransformed phenotype measures (Figure 2b and Figure 3b).One important point is to make a distinction between predictability and reliability, as a perfectly reliable phenotype might still not be predictable or, even worse, might not be related to the phenomena being studied.Finn and Rosenberg (2021) illustrates this point by making an analogy between functional connectivity (FC) and barcodes.Barcodes are unique patterns, but they have no intrinsic connection to any noteworthy traits of the items they label.Therefore barcodes can have excellent accuracy for identification purposes while offering limited value in predicting behavior.In simpler terms, FC might be one-of-a-kind yet fail to provide meaningful insights (high reliability but low validity).
The majority of SVD latent variables are needed to explain 95% of phenotype variance, but only the first few can be reliably estimated.Phenotype prediction is a challenging task, with performance often being negligible, in terms of r 2 , if not undistinguishable from chance level.Ooi et al. (2022) showed that latent variables derived from phenotypes using factor analysis could be more predictable than using the untransformed phenotypes in the HCP and ABCD datasets.We wanted to assess if this finding would replicate with a different dimensionality reduction algorithm, and generalize to other datasets.
The present study employed SVD because it is the most efficient approach to producing low-rank data approximations.The questionnaires used to produce the phenotype measures in each study are chosen to span a range of domains of cognition.However, it is quite likely that each phenotype measure draws from more than one latent construct, leading to a complex correlation structure between them.This correlation structure -identified with SVD, PCA, or FA -allows robust identification of latent variables, and has been used to guarantee the performance of transfer learning algorithms (He et al., 2022).The resulting latent variables may also be more interpretable because they are uncorrelated.
For both datasets, the majority of the latent variables were required to account for 95% of the variance; at the same time, the first variables explain more variance than most others (Figure 5).Because of this behavior, we hypothesised that the predictability of each latent variable would be correlated with how much variance it explains.As discussed above and visible from Figure 2d, the observed phenomenon is more complicated than that.Over the first few latent variables, there was no clear relationship between the amount of variance explained and predictability.This is particularly visible for the HCP data, where the first latent variable, which explains 16% of the variance, was not the most predictable one.At the same time, predictability decays rapidly after the first few latent variables (Figure 2d).This latter finding prompted us to study the reproducibility of the loadings associated with each latent variable.To do this, we used the Gale-Shapley stable marriage algorithm (Gale and Shapley, 1962), where we used the correlation to match latent phenotypes across different splits.After further investigation, we observed that, for both datasets, only the loadings for the first five components were reliably identified on 1000 repetitions (Figure 6), and reliability quickly decays for the remaining factors.We think that due to the mathematical properties of the SVD (that requires all components to be orthonormal), we see an increase of reliability in the low-variance phenotypes again.The increase in reliability for the first components might be related to their increased predictability compared to the other latent phenotypes.Further highlighting this complex relation between predictability and reliability, we also noticed that, in the HCP dataset, some of the variables explaining the least variance are most predictable.This experiment is important as it showcases that, beyond five latent phenotypes, there would be no guarantee of finding the same latent phenotypes in different samples.If we cannot even ensure reliable splits within the same dataset, our ability to uncover relationships in out-of-sample associations becomes increasingly uncertain.Our results raise an interesting question: why do latent features beyond the fifth one become unreliable, and why do they explain so little variance?One simple guess would be that starting from the fifth component, what is predominantly being captured is either noise or variance that occurs on very few subjects, as opposed to meaningful, structured information inherent in the data.
A low-dimensional representation of phenotype variables has similar predictability to the original phenotypes.The unreliability from the fifth latent phenotype onward prompted us to question how much information could be predicted using only the reliable latent variables.Figure 2c and Figure 3c demonstrate that only a few latent variables are necessary to adequately reconstruct the training set phenotype measures, and achieve similar performance on the validation set as a model trained using the original phenotype measures.This finding suggests that relations between phenotype measures that are present across many participants may be captured primarily on the first few latent variables.Therefore, SVD could serve as a denoising algorithm, providing means to work with fewer latent variables in training but still yielding as good or better prediction performance as using the original phenotype measures.
The relationship between noise and reliability and predictive models is crucial but, while the importance of sample size for statistical power is widely acknowledged, the consideration of measurement reliability when estimating the required sample sizes remains an under-addressed aspect (Zuo et al., 2019).Following this logic, Gell et al. (2023); Nikolaidis et al. (2022) propose that focusing on the reliability of phenotypic measurements during target selection and choosing only those with high reliability might improve the performance of brain-behavior models.However, due to the notable variability in reliability, the proposed sample size requirements may not universally apply (Rosenberg and Finn, 2022;Chen et al., 2023).As a significant consequence, when the reliability of a measurement diminishes, a larger number of participants will be needed to accurately identify a correlation between two measures (Zuo et al., 2019).Some studies have found that a larger sample size helps obtain a higher prediction performance, but carefully choosing reliable targets has a bigger impact on the model's performance (Gell et al., 2023;Nikolaidis et al., 2022).One remaining question is whether using low-rank SVD reconstructions of the phenotype measures lessens the sample sizes needed to obtain a given prediction performance.This could be tested by running resampling experiments with smaller sample sizes, but this is beyond the scope of the present paper.
In this paper, we did not assess the reliability of the features that are important for prediction.Many works that trained brain-behavior models use a variety of interpretation techniques -e.g., feature weights or saliency maps -to analyze which brain regions were more relevant for prediction (Finn et al., 2015;Jiang et al., 2020;Chen et al., 2022).While reliability of the prediction targets has not been the focus of some of these studies (Finn et al., 2015;Jiang et al., 2020;Chen et al., 2022)., the feature weights of models frequently show moderate to low test-retest reliability as well.Surprisingly, a linear transformation to these feature weights (i.e., Haufe-transformed weights (Haufe et al., 2014)) demonstrated greater reliability compared to untransformed weights (Tian and Zalesky, 2021;Chen et al., 2023).

Limitations of the current work
To overcome the scarcity of data, a few studies have used transfer learning to improve the performance of brain-behavior models (Holderrieth et al., 2022;He et al., 2022;Chopra et al., 2022;Wulan et al., 2024) and shown that information from large datasets with rich phenotypic data can be used to improve performance on smaller datasets.However, there are still many implementation details that need to be further investigated before being able to transfer phenotype models from large to small datasets successfully.Generalizability, the broad applicability of a model, is closely linked to the concept of transfer learning.The main difference is that while generalizability refers to the model's ability to perform well on new, unseen data, in transfer learning, a model predicting a phenotype is first trained on a large dataset and then the model is refined to improve its performance on specific smaller datasets, combining the general information learned from the larger dataset with information specific to the smaller dataset.A few studies have evaluated the generalizability of models trained on one dataset to other datasets, for a variety of brain-behavioral phenotypes, and have shown that fluid intelligence is one of the few targets for which models have shown moderate performance when tested on new datasets (Wu et al., 2022;Tong et al., 2022).In this study, we did not explore transfer learning, as we explored the idea shown by previous research (Nikolaidis et al., 2022;Gell et al., 2023) that by using more reliable prediction targets (in our case obtained from latent components), we could improve the model prediction but as we have seen the using latent phenotypes did not improve the performance.
When pre-processing the phenotypes, we z-scored all phenotypes, set values that were above and below 3 standard deviations to zeros and treated those values as missing by the imputation algorithm.This choice of how to deal with outliers could have an impact on our analysis and was one of the possible researcher degrees of freedom in our analysis.In particular, by making this choice, we are eliminating the most severe scores.A possible alternative would have been to use winsorizing instead of treating the severe scores as missing.Another limitation of our preprocessing stems from the fact that we used different processing pipelines from HCP and PNC.While for the HCP we were using the already pre-processed data for HCP, we pre-processed the data using fMRIPrep.Because of this choice, we had additional information about motion parameters for PNC and used this information to censor time points that had excessive motion, but we did not censor time points with excessive motion in HCP.
Another limitation of the current work stems from our reliance on a ridge regression model, as this can only capture a linear relationship between the functional connectivity data used as input and the phenotype measures used as prediction targets.It is still debated if the brain-behavior relationship can be better predicted using linear, and therefore more explainable models, or benefit from the nonlinear models.While Bertolero and Bassett (2020) showed that deep neural networks can model simple and complex relationships between brain and behavior, He et al. (2020) defended that deep neural networks and kernel regression yield similar levels of accuracy when it comes to predicting behavior and demographics through functional connectivity analysis.On the other hand, Pervaiz et al. (2020) and Schulz et al. (2020) showed that simple linear models perform very similarly to more complex ones.The second limitation is the use of a linear method (SVD) to derive latent variables from the phenotype measures.SVD captures covariance relationships, a linear measure of association between those measures.It is possible that there are also non-linear relationships between the phenotypes, which could be identified with non-linear dimensionality reduction approaches such as auto-encoders.Here, we chose to use both a linear model and a linear dimensionality reduction approach to set a baseline for future analyses and facilitate comparison with previous work.
In conclusion, we showed the importance of controlling for age and sex confounds when creating a brain-behavior model.Failure to remove them could lead to artificially inflated prediction results.Our main contribution is showing that, by reconstructing phenotype variables using the first few latent SVD variables, we could obtain a very similar performance training a model on the original variables.This suggests that most of the information about phenotype that can be identified using a linear model trained on functional connectivity data is present only in the first latent phenotypes.We suggest that future research should further explore the usage of latent variables derived from phenotypes.
DM: design the study, data curation, contributed to the manuscript EE: data curation, contributed to the manuscript GL: methodology feedback, contributed to the manuscript AGT: design the study, contributed to the manuscript PM: design the study, contributed to the manuscript FP: data curation, design the study, contributed to the manuscript Data and Code availability All code is available at https://github.com/JessyD/brain-phenotypes.There the user can also find the lists of participants' IDs and behavior scores utilized for both datasets.Both PNC and HCP are publicly available and can be obtained after acceptance of their respective data agreement.
Preprocessing of B0 inhomogeneity mappings A total of 1 fieldmaps were found available within the input BIDS structure.A B0 nonuniformity map (or fieldmap) was estimated from the phase-drift map(s) measure with two consecutive GRE (gradient-recalled echo) acquisitions.The corresponding phase-map(s) were phase-unwrapped with prelude (FSL 6.0.5.1:57b01774).A total of 1 T1-weighted (T1w) images were found within the input BIDS dataset.The T1-weighted (T1w) image was corrected for intensity non-uniformity (INU) with N4BiasFieldCorrection (Tustison et al., 2010), distributed with ANTs 2.3.3 (Avants et al., 2008, RRID:SCR 004757), and used as T1w-reference throughout the workflow.The T1w-reference was then skull-stripped with a Nipype implementation of the antsBrainExtraction.sh workflow (from ANTs), using OASIS30ANTs as target template.Brain tissue segmentation of cerebrospinal fluid (CSF), white-matter (WM) and gray-matter (GM) was performed on the brain-extracted T1w using fast Zhang et al. (2001).Brain surfaces were reconstructed using recon-all (FreeSurfer 6.0.1, RRID:SCR 001847, Dale et al., 1999), and the brain mask estimated previously was refined with a custom variation of the method to reconcile ANTs-derived and FreeSurfer-derived segmentations of the cortical gray-matter of Mindboggle (RRID:SCR 002438, Klein et al., 2017 Functional data preprocessing For each of the 3 BOLD runs found per subject (across all tasks and sessions), the following preprocessing was performed.First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep.Head-motion parameters with respect to the BOLD reference (transformation matrices, and six corresponding rotation and translation parameters) are estimated before any spatiotemporal filtering using mcflirt (FSL 6.0.5.1:57b01774,Jenkinson et al., 2002).BOLD runs were slice-time corrected to 1.47s (0.5 of slice acquisition range 0s-2.94s) using 3dTshift from AFNI (Cox and Hyde, 1997, RRID:SCR 005927).The BOLD timeseries (including slice-timing correction when applied) were resampled onto their original, native space by applying the transforms to correct for head-motion.These resampled BOLD time-series will be referred to as preprocessed BOLD in original space, or just preprocessed BOLD.The BOLD reference was then co-registered to the T1w reference using bbregister (FreeSurfer) which implements boundary-based registration (Greve and Fischl, 2009).Co-registration was configured with six degrees of freedom.Several confounding time-series were calculated based on the preprocessed BOLD: framewise displacement (FD), DVARS and three region-wise global signals.FD was computed using two formulations following Power (absolute sum of relative motions, Power et al. (2014)) and Jenkinson (relative root mean square displacement between affines, Jenkinson et al. (2002)).FD and DVARS are calculated for each functional run, both using their implementations in Nipype (following the definitions by Power et al., 2014).The three global signals are extracted within the CSF, the WM, and the whole-brain masks.Additionally, a set of physiological regressors were extracted to allow for component-based noise correction (CompCor, Behzadi et al., 2007).Principal components are estimated after high-pass filtering the preprocessed BOLD time-series (using a discrete cosine filter with 128s cut-off) for the two CompCor variants: temporal (tCompCor) and anatomical (aCompCor).tCompCor components are then calculated from the top 2% variable voxels within the brain mask.For aCompCor, three probabilistic masks (CSF, WM and combined CSF+WM) are generated in anatomical space.The implementation differs from that of Behzadi et al. in that instead of eroding the masks by 2 pixels on BOLD space, the aCompCor masks are subtracted a mask of pixels that likely contain a volume fraction of GM.This mask is obtained by dilating a GM mask extracted from the FreeSurfer's aseg segmentation, and it ensures components are not extracted from voxels containing a minimal fraction of GM.Finally, these masks are resampled into BOLD space and binarized by thresholding at 0.99 (as in the original implementation).Components are also calculated separately within the WM and CSF masks.For each CompCor decomposition, the k components with the largest singular values are retained, such that the retained components' time series are sufficient to explain 50 percent of variance across the nuisance mask (CSF, WM, combined, or temporal).The remaining components are dropped from consideration.The head-motion estimates calculated in the correction step were also placed within the corresponding confounds file.The confound time series derived from head motion estimates and global signals were expanded with the inclusion of temporal derivatives and quadratic terms for each (Satterthwaite et al., 2013).Frames that exceeded a threshold of 0.5 mm FD or 1.5 standardised DVARS were annotated as motion outliers.The BOLD time-series were resampled into standard space, generating a preprocessed BOLD run in MNI152NLin2009cAsym space.First, a reference volume and its skull-stripped version were generated using a custom methodology of fMRIPrep.The BOLD time-series were resampled onto the following surfaces (FreeSurfer reconstruction nomenclature): fsaverage6, fsaverage.Automatic removal of motion artifacts using independent component analysis (ICA-AROMA, Pruim et al., 2015) was performed on the preprocessed BOLD on MNI space time-series after removal of non-steady state volumes and spatial smoothing with an isotropic, Gaussian kernel of 6mm FWHM (full-width half-maximum).Corresponding "non-aggresively" denoised runs were produced after such smoothing.Additionally, the "aggressive" noise-regressors were collected and placed in the corresponding confounds file.Grayordinates files (Glasser et al., 2013b) containing 91k samples were also generated using the highest-resolution fsaverage as intermediate standardized surface space.All resamplings can be performed with a single interpolation step by composing all the pertinent transformations (i.e.head-motion transform matrices, susceptibility distortion correction when available, and coregistrations to anatomical and output spaces).Gridded (volumetric) resamplings were performed using antsApplyTransforms (ANTs), configured with Lanczos interpolation to minimize the smoothing effects of other kernels (Lanczos, 1964).Non-gridded (surface) resamplings were performed using mri vol2surf (FreeSurfer).
Many internal operations of fMRIPrep use Nilearn 0.8.1 (Abraham et al., 2014, RRID:SCR 001362), mostly within the functional processing workflow.For more details of the pipeline, see the section corresponding to workflows in fMRIPrep's documentation.

Singular Value Decomposition (SVD)
Definition Given a n × d data matrix X, the SVD of X is where U is n × k, S is k × k, and V is d × k, for a given number of latent variables k.The matrices U and V are called the left and right singular vector matrices (the columns are the singular vectors), and S is the singular value matrix (the diagonal contains the singular values).
The matrices U , S, and V have various useful properties: -U and V are orthonormal: their columns have norm 1, and are orthogonal to each other.
-As orthonormal matrices, U and V can be viewed as bases for the spaces spanned by columns or rows of X, respectively, and where coordinates for X would be SV ′ or U S. -S is a diagonal matrix, with different values in each entry of the diagonal.
k can only be as large as min(n, d); if X is of rank lower than that -which can happen if columns or rows are linearly dependent -it can be smaller.
Dimensionality reduction with SVD In the dimensionality terminology used in the paper, U S is the latent variable matrix, and V ′ is the mixing matrix.The variables in the low-dimensional space are uncorrelated, which makes them suitable for various applications (e.g., visualization, and preprocessing for other methods that assume uncorrelated variables).
In addition, the matrix V can be used both to convert X (or other examples) in the d-dimensional space into the k-dimensional space, as well as convert any examples in the low-dimensional space back into the d-dimensional space by multiplying them by V ′ .
Figure 7 (top) shows how each individual entry x i,j in matrix X is reconstructable as a multiplication of the i th row of U (u i,: ), scaled by the entries in the diagonal of S, and by the j th column of V ′ .
An essential step to take into consideration before applying SVD is normalization as (1) it ensures scale-invariance in the SVD results, preventing larger-scale variables from dominating the analysis and distorting the interpretation of relationships between variables; (2) without normalization, it becomes challenging to understand the relative importance and contributions of variables due to varying scales, making the interpretation less intuitive; (3) it enhances the numerical stability of SVD computations, reducing the risk of numerical errors that can arise when data scales vary widely.
. Top: The support vector decomposition of matrix X as a product of orthonormal matrix U , diagonal matrix S, and the transpose of orthonormal matrix V .Each entry xij in matrix X is reconstructable as a multiplication of the i th row of U (ui,:, scaled by the entries in the diagonal of S, by the j th column of V ′ .Bottom: The matrix X can be expressed as a sum of rank 1 matrices R1, . . ., R k , where R l is the outer product u :,l ⊗ v ′ :,l multiplied by the scalar s l,l .
Relationship between SVD and the data matrix SVD provides the best approximation of the matrix X at any specified rank (i.e., one can take the full SVD and expand it to show that it reconstructs X as a sum of k rank 1 matrices, in other words, R 1 , . . ., R k : as shown in Figure 7 (bottom)).Each of these matrices explains some of the variance in X.If we sort the dimensions of the SVD from those that explain the most to those that explain the least, adding k matrices will yield the best rank k approximation of X, in terms of minimizing squared error.
If we take the diagonal of the matrix S, s = diag(S), then the percentage of variance accounted for by dimension i is . Adding up the variance for dimensions 1 − i will give you the total variance explained by the rank i approximation.This is useful in data compression (best approximation), denoising (many dimensions with little variance are capturing noise), or matrix completion (if there are missing entries, the SVD reconstruction can be a good multivariate imputation method, subject to certain conditions).All of this results from the fact that, if minimizing squared error to the original matrix X, the SVD with k latent phenotypes is the best rank k approximation of X.
Relationship between PCA and SVD To prove the relationship between PCA and SVD, we assume that the dataset X has been centered, i.e.where each column has mean 0. The covariance matrix for a centered dataset X is where E is a d × d matrix, and Λ is a diagonal matrix.The E is an orthonormal matrix (column vectors are norm 1, and orthogonal to each other), where columns are called eigenvectors.The diagonal of Λ, λ = diag(Λ) contains the corresponding eigenvalues.The matrix E acts as a basis for reconstructing the covariance matrix E. The fraction of variance explained by eigenvector i is λi d j=1 λj .The relationship between SVD and PCA becomes clear if we consider that C can be expressed in terms of the SVD matrices, as follows so the eigenvector matrix E is the same as the right singular vector matrix V (with some caveats if X is not full rank).

List of all the phenotypes scores used from HCP and PNC
The tables below contain the variable names used in our study, the description of what it refers to, and which category the questionnaire comes from.

Variable Description Category
AngAffect Unadj*  5.6 How many latent phenotypes do we need to have an accurate prediction?
Fig. 11.Marriage results using the inter-split correlation.In contrast to the plot in the main text, where we report the r 2 , here we report the inter-split correlation.
PNC HCP Fig. 12. Critical difference diagrams for HCP (left) and PNC (right).For the HCP data, the post-hoc Tukey HSD showed that using just one latent phenotype had a significantly worse performance than all other groups.There is, however, no significant difference in using all or a subset of latent phenotypes for the PNC dataset Hypotheses For each of the research questions listed in the previous section, provide one or more specific and testable hypothesis.Please make clear whether the hypotheses are directional (e.g., A > B) or non-directional (e.g., A ̸ = B).If directional, state the direction.You may also provide a rationale for each hypothesis.
1. How does controlling for age/sex confounds impact the phenotypes prediction?
-Hypothesis: In accordance with the literature (Chyzhyk et al., 2022) and the results we obtained on our exploratory analysis (Figure 4), failure to remove age and sex confounds will lead to a higher correlation coefficient for variables that are strongly correlated with age/sex.Accordingly, we expect that predictive performance will be lower after regressing age/sex out of the phenotypes.Though our expectation is directional, we will test this with a two-sided paired t-test on the predictive performance with and without regressing out age and sex.
-Rationale: Information from sex and age can drive the predictability of specific phenotypes.If not regressed out, those confounds can inflate the predictability of some phenotypes.This is particularly visible in the variable 'Strength Unadjusted' from our exploratory analysis of the HCP dataset.If sex is not regressed out, it is the most predictable variable (Figure 2a), but most of its predictability is given by the sex information.2. How reliable are the latent phenotypes estimated with SVD across different samples?
-Hypothesis: Only the first five latent phenotypes (in order of explained variance) will be reliable across different samples, with subsequent latent phenotypes showing decreased reliability.For a complete description of how our reliability metric was computed, see section 2.7.We specifically define a reliable component as having an empirical lower 95% confidence interval on the r 2 greater than 0.2.The empirical confidence interval will be calculated from 1,000 random splits of the data.
-Rationale: Our previous analyses of the HCP and PNC datasets have demonstrated that while the first few latent phenotypes are reliably identified across different samples, the reliability diminishes for subsequent latent phenotypes.We anticipate a similar pattern in the HBN dataset, where the initial latent phenotypes should be reliably identified across different dataset splits.3. Can we obtain more predictable phenotypes if we use a linear transformation to latent phenotypes (SVD) on HBN? -Hypothesis: Applying a linear dimensionality reduction (SVD) to obtain latent phenotypes on the HBN dataset will not decrease the predictability compared to directly using phenotypes.We will evaluate their predictability by comparing the correlation of the prediction for the untransformed phenotypes and those transformed using SVD with a two-sided paired t-test.Again our expectation is directional but we are using a two-sided case because it will still be of interest if either the latent or the original phenotypes have significantly greater performance.
-Rationale: Given our exploratory analysis with HCP and PNC where we did not observe an increase in predictability by using latent phenotypes, in comparison with the original phenotype variables, we do not expect to see that the former will be more predictable.4. Are phenotype variables reconstructed from the SVD latent variables as informative for training a model as the original phenotype variables?-Hypothesis: The reconstruction of phenotypes using a subset of latent phenotypes, particularly the first few components, will exhibit similar or enhanced predictability compared to using all latent phenotypes.To evaluate if there is any statistical difference in the performance of the prediction algorithms on using all of the components or a subset thereof, we will use the autorank library (Herbold, 2020;Demšar, 2006).See sections 2.5 and 3.5 for a more detailed description.
-Rationale: Linear transformations such as SVD aim to capture the underlying structure within complex datasets by creating uncorrelated variables (latent phenotypes) that explain as much variance as possible.Conversely, the original data can be reconstructed from the most informative latent phenotypes, retaining essential information while discarding redundant or noisy signals.If used to denoise the phenotype variables in the training set of prediction models, the denoising can potentially lead to improved predictive performance.Therefore, we expect that prediction performance of models trained on reconstructed phenotypes will not decrease and may even surpass that of models trained using the original phenotype variables.

Data Description
Datasets used Name and briefly describe the dataset(s), and if applicable, the subsets of the data you plan to use.Useful information to include here is the type of data (e.g., cross-sectional or longitudinal), the general content of the questions, and some details about the respondents.In the case of longitudinal data, information about the survey's waves is useful as well.Mention the most relevant information so that readers do not have to search for the information themselves.
In the first part of this manuscript, we conducted an exploratory analysis using the HCP (Van Essen et al., 2012) and PNC dataset (Satterthwaite et al., 2016), for our confirmatory analysis, described in this pre-registration, we will use the Healthy Brain Network (HBN) dataset (Alexander et al., 2017).While the HCP dataset includes young adults aged between 22 and 35, the PNC dataset includes children aged 8 to 21.Both PNC and HBN are neurodevelopmental cohorts acquired in the Philadelphia and New York metropolitan areas, respectively.The HBN dataset contains neuroimaging and phenotypic data gathered across a diverse population of children and adolescents (ages 5-21).The dataset includes, among other modalities, functional MRI, diffusion MRI and electroencephalography collected longitudinally.It also includes extensive behavioral assessments, cognitive tests and psychiatric diagnostics.For this analysis, we will use the resting state MRI images and cognitive phenotypes, and only used cross-sectional information.The HBN data was acquired in 3 sites, using different scanner manufacturers, a number of time points and length of scans; an overview of the MRI protocol can be found via the NITRC website (Kennedy et al., 2016).
on exploratory factor analysis (EFA) or confirmatory factor analysis (CFA), also specify the relevant details (EFA: rotation, how the number of factors will be determined, how best fit will be selected, CFA: how loadings will be specified, how fit will be assessed, which residuals variance terms will be correlated).If you are using any categorical variables, state how you will code them in the statistical analyses.
We will use a subset of the Healthy Brain Network (HBN) dataset, comprising 895 subjects.The dataset consists of a training set with 672 individuals (386 males and 286 females) with an average age of 12.40 ± 3.60 years, and a separate test set containing 105 subjects (66 males and 39 females) with an average age of 12.32 ± 3.60 years, and a hold-out set containing 114 subjects (67 males and 47 females) with an average age of 12.52 ± 3.76 years.Subjects were assigned to the training, test, and hold-out sets in a stratified sampling procedure.Subjects were sorted by age and sex at birth, and randomly within each sex.They were then assigned in a proportional round-robin fashion to training, test, and hold-out datasets.Although the HBN dataset contains 11 diagnostic labels, we opted not to use this information for splitting the dataset, as some conditions are represented by very few subjects.Additionally, incorporating extra criteria for comorbidity would have been necessary, complicating the dataset division process.Also, note that according to the pre-processing notation we adopted for all datasets, we encoded female as 1 and male as 2; however, when regressing out the covariates, we encoded male as 1 and female as 0.
A list with the subject IDs for the train, test, and hold-out dataset will be provided on the gitrepo together with the code for the analyses (https://github.com/JessyD/brain-phenotypes)and on the OSF description of this project.

Imaging data
We have already preprocessed the HBN imaging data with the same preprocessing described in our preprint for the PNC dataset.The preprocessing of the HBN dataset was conducted using fmriprep version 21.0.2.Motion censoring was performed at the individual time point level, with points exceeding 0.2 mm frame-wise displacement or a derivative root mean square (DVARS) above 75 marked for censoring.Intervals of less than five points between censor points were also flagged for censoring.The six estimated head-motion parameters, their derivatives, and average signals within anatomically-derived white matter and cerebrospinal fluid masks obtained from fmriprep, were regressed from the functional magnetic resonance imaging (fMRI) signal prior to functional connectivity (FC) computation.We generated functional connectivity matrices by calculating Pearson correlations among the average time series of each brain region pair.The regions were delineated using the Schaefer 400 parcellation, which consists of 17 networks (Schaefer et al., 2018).If any of the runs had more than 50% of the runs flagged as time points to be censored, we do not use that run for computing the FC.Some subjects from the HBN dataset contain more than one resting state runs, if more than one run is available for a subject, we will use an average of the computed functional connectivity.For a complete description of how the functional connectivity matrix will be computed, please refer to section 2.2.
Phenotypes While the HBN dataset encompasses both behavioral and clinical phenotypes, our analysis focused exclusively on behavioral measures.We specifically selected cognitive phenotypes that cover constructs similar to those found in the HCP and PNC datasets.For each specific instrument, we selected the summary metric it provided, rather than individual answers or subscales, to ensure consistency with the other datasets in our analysis.Ultimately, while the original phenotypes cover behavioral, personality, and mental health domains, we curated a set of 29 behavioral scores (see section 3 for a full list).As a final check to identify problems or redundancy, we also calculated the correlation between the different phenotypes selected (Figure 13).As described below, missing phenotypes or phenotypes that are above and below 3 standard deviations will be treated as missing data and imputed.All phenotype values will be z-scored to be on a similar scale and avoid scores that are higher to dominate the model prediction.For a complete description, please refer to section 2.1 and section 2.1, where we describe how the behavioral phenotypes are preprocessed for the HCP and PNC datasets.
Unit of analysis Which units of analysis (respondents, cases, etc.) will be included or excluded in your study?Taking these inclusion and exclusion criteria into account, indicate the expected sample size of the data you'll be using for your statistical analyses.If you have a research question about a certain group you may need to exclude participants based on one or more characteristics.Be very specific when describing these characteristics so that readers will be able to redo your moves easily.Subjects were chosen if their imaging preprocessing pipeline was completed without error and if any of the runs had more than 50% of the time points flagged as points to be censured (due to the recessive motion), we do not use that run for computing the FC.The numbers described in section 6.7, describe the dataset that will be used for this confirmatory analysis.Missing phenotypes were not used as criteria for exclusion, as we imputed the missing phenotype values.Missing data What do you know about missing data in the dataset (i.e., overall missingness rate, information about differential dropout)?How will you deal with incomplete or missing data?Provide descriptive information, if available, on the amount of missing data for each variable you will use in the statistical analyses.Based on this information, provide a new expected sample size.Similar to the analysis described in our preprint for the HCP and PNC dataset (section 2.1 and 2.1, each phenotype measure underwent z-scoring across the phenotypic measures.We identified missing values and removed outliers (i.e., variables that were above/below 3 standard deviations for that phenotype), and imputed both using the IterativeImputer function from the scikit-learn library (Pedregosa et al., 2011).The main idea behind the IterativeImputer is to model each missing value in one variable as a function of all the other variables available for a subject that is present.The algorithm works iteratively in a round-robin matter, where during each iteration, one phenotype is chosen as (y), and the remaining features are used as input to the imputation algorithm.Subjects that had no phenotypical information were excluded; however, if one phenotype was present, all the remaining values were imputed.
Sampling weights Are there sampling weights available with this dataset?If so, are you using them or are you using your own sampling weights?Sampling weights can be useful in secondary data analysis because the sample may not be entirely representative of the population you are interested in.
There are no sampling weights.Models are trained and evaluated on samples from the same population.

Knowledge of Data
Prior Publication/Dissemination List the publications, working papers, and conference presentations you have worked on that are based on the dataset you will use.For each work, list the variables you analyzed, but limit yourself to variables that are relevant to the proposed analysis.If the dataset is longitudinal, also state which wave of the dataset you analyzed.Specify the previous works for each co-author separately.The work described in the main manuscript, in which all co-authors were involved, analysed the predictability of phenotypes from the Human Connectome Project (HCP) and the Philadelphia Neurodevelopmental Cohort (PNC).Building on these findings, we are testing the generalizability of our results by incorporating a third dataset.Therefore, the analyses described in this preregistration will be the same as those described in our preprint (the main part of this document).
Prior knowledge Disclose any prior knowledge you may have about the dataset that is relevant for the proposed analysis.If you do not have any prior knowledge of it, please state so.Your prior knowledge could stem from working with the data first-hand, from reading previously published research, or from codebooks.Provide prior knowledge for every author separately.
All co-authors are familiar with the original publication provided by the HBN consortia where the recruitment criteria and data description are detailed (Alexander et al., 2017).We have already preprocessed the neuroimaging data as described above, selected the phenotypes, and formatted the phenotypic data to be compatible with our analytical pipeline.
Jessica Dafflon has worked with the structural MRI from the HBN.
Dylan M. Nielson has worked with the HBN data before and published a pre-print Validation of CBCL depression scores of adolescents in three independent datasets (Zelenina et al., 2023).Dustin Moraczewski has interacted with the dataset through familiarity with the relevant data papers and other literature, as well as the dataset contents, and has engaged in downloading, preprocessing, quality checking, and preparing the data for collaborators but has not implemented any further analyses.
Eric Earl has interacted with the dataset only to reshape the tabular phentoype data for easier ingestion into the project's analysis (Earl et al., 2023).
Gabriel Loewinger has never worked with the HBN dataset before.
Patrick McClure has never worked with the HBN dataset before.
Adam G. Thomas has familiarity with the dataset via analysis presented in Lam et al. (2022) which has been presented at lab meetings.
Francisco Pereira was involved in a separate study (Lam et al., 2022) where they used the HBN data to test a new method for creating interpretable latent variables.However, in this study, the authors have used only the clinical phenotypes from the HBN.Neither the behavior nor the imaging data which we will use were analyzed.

Analyses
Statistical models For each hypothesis, describe the statistical model you will use to test the hypothesis.Include the type of model (e.g., ANOVA, multiple regression, SEM) and the specification of the 1.Evaluate our previous findings with a third dataset: Our major effort is to assess whether previous findings for PNC and HCP would generalize to an independent dataset.This is why we are pre-registering this analysis where we re-implement the same methods on a separate dataset.2. Assess the reliability of the latent phenotypes: As we mentioned above (Analysis, statistical models), there is no guarantee that if we repeat the SVD on different splits of the data, we would obtain the same results.Therefore, we repeated the analysis on 1,000 different splits and assessed how reliably we could identify the components on different splits (See section Analysis, statistical methods or our preprint for a more detailed description).3. Resampling within datasets: All of our reported results have been conducted using bootstrapping (i.e., we repeated our analysis 100 times with different training datasets resampled from the original one -see section 2.5 for a detailed description).We use these bootstraps to reduce the variance of our predictive performance estimates.
Exploratory analysis If you plan to explore your dataset to look for unexpected differences or relationships, describe those tests here.If reported, add them to the final paper under a heading that clearly differentiates this exploratory part of your study from the confirmatory part.
-How predictable are phenotypes from functional connective in HBN, and how do they compare to the predictability of the other datasets we analyzed?To answer this question we will fit a ridge regression to try to predict phenotypes from resting state functional connectivity.We will check if the obtained predictions are within the same range (r = [0, 0.4]) as those obtained for the other two datasets.For a detailed description of how the ridge regression model will be trained and which variables will be used as input, please refer to section 2.5 of the main manuscript, as the same model architecture (retrained for the HBN dataset) and pre-processing used for HCP and PNC will be used for the new dataset

Fig. 2 .Fig. 3 .Fig. 4 .
Fig. 2. Predictability of phenotype variables in the HCP validation set before regressing out confounds (a), after confound removal (b), reconstructed using only a subset of the latent phenotypes (c), and of the latent phenotypes (d).Predictability is quantified as the correlation between true phenotypes and prediction from functional connectivity data.Panel c shows the correlation of the predicted phenotypes with true values when the model was trained with a subset of latent phenotypes (1, 2, 3, 4, 5 and all, represented by the different shades of green).The shaded regions represent the standard deviation across resamplings.The phenotypes (y-axis) are ordered by the prediction performance, with the exception of (d), which is ordered by the variance explained by the latent phenotypes.(a) Before regressing out, the phenotypes with the best performance is "Strength unadjusted"; after regressing age and sex, cognitive phenotypes have the best performance.In addition, using latent phenotypes (d) or untransformed phenotypes (b) does not impact the performance substantially.

Fig. 6 .
Fig.6.Latent phenotypes reliability for both datasets HCP (in green; left) and PNC (in blue; right).The loadings of the first 5 SVD components were reliably identified within different splits using the Gale-Shapley stable marriage algorithm.The shaded area represents the 95% confidence intervals. 1

Fig. 9 .Fig. 10 .
Fig. 9. Predictability of phenotype variables in the validation set, quantified by the coefficient of determination (r 2 by correlation between predicted and true values as in Figure 3) between predictions and true values, for PNC before regressing the covariates (a), after regression (b), reconstruction (c), and of the latent phenotypes (d).The shaded regions represent the standard deviation across resamplings.

Fig. 13 .
Fig. 13.Pairwise correlation between all the HBN phenotypes that we will include in our analysis

Fig. 14 .
Fig. 14.Pairwise correlation between all the HBN phenotypes.This plot includes some of the variables that we defined as redundant as they include the same information, just slightly modified by any type of normalisation (percentile, scale, sum)

Table 3 .
This table lists the phenotype variables from the Healthy Brain Network (HBN) dataset that will be utilized for analysis.The variables are ordered by the instrument they belong to, marked in bold.